-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add check functions #276
add check functions #276
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #276 +/- ##
=========================================
+ Coverage 0 93.70% +93.70%
=========================================
Files 0 23 +23
Lines 0 1192 +1192
=========================================
+ Hits 0 1117 +1117
- Misses 0 75 +75
☔ View full report in Codecov by Sentry. |
This PR adds 2 lines that are uncovered by unit tests, its a conditional to check and raise an error if the user has supplied something other than "object" or "source" to |
src/tape/ensemble.py
Outdated
idx = self._source.index | ||
else: | ||
raise ValueError(f"{table} is not one of 'object' or 'source'") | ||
return idx.map_partitions(lambda a: np.all(a[:-1] <= a[1:])).compute().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not obvious to me how this works. Doesn't map_partitions call the function (in this case the lambda) once per partition? If so, it is only checking that the last element is smaller or equal to the first one. Wouldn't that leave out the elements in the middle?
From the tests this looks like it works, so I must be missing something. Is there a comment you could add to provide more info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling this out, it's a vectorized operation that's actually doing element-wise comparison within the partition. As I was trying to put together an explanation, I realized that this is only doing comparison within the partition, and not across partitions. I've swapped this out to just use the existing is_monotonic_increasing
function within the Dask/Pandas Index API, and expanded the test case to test the multi-partition case.
src/tape/ensemble.py
Outdated
counts = idx.map_partitions(lambda a: Counter(a.unique())).compute() | ||
|
||
unq_counter = counts[0] | ||
for i in range(len(counts) - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be clearer as range(1, len(counts))
and then unq_counter += counts[i]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
This PR addresses #275, adding two functions (
check_sorted
andcheck_lightcurve_cohesion
) that assess whether the ensemble data is sorted and whether the source data is co-located within single partitions (referring to this as "cohesion").There's a bit of a snowball effect to implementing these, so this ends up addressing #242 by adding a
sort
flag to loader calls that defaults to False. This allows the Ensemble to actually be unsorted, making thecheck_sorted
function useful. The sort flag was added to Dask in 2023.6.1, so I've also updated the minimum version to this inpyproject.toml
, this broke something inensemble.insert_sources
, where a small tweak is needed to the ColumnMapper and sorting needs to be on for it and potentially any upstream data loading. This may be confusing to users, but I also see this function as mainly for devs and is probably unlikely that users will need it much?